House Pricing Prediction

Luis Carlos Olivares Rueda
Applied Physicist

In this work we are using regression models to predict house prices based on several elements such as number of bathrooms, year of construction, wheter it has or not a basement, etc.

This work covers:

  • Data Cleaning
  • Data Engineering
  • Data Selection
  • Data Visualization
  • Regression Analysis

The dataset was taken from Kaggle.
The dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames and Iowa.

As we are dealing with house prices we can expect that the year of construction, the total area inside the house (dimensions), number of rooms and bathrooms will have a great impact on the house prices.
Because of this we can expect from the data selection and regression analysis that the most important features will be the ones that we expect.

Libraries and Important Functions

In [1]:
#Import all the neccesary libraries

import numpy as np
import pandas as pd

from scipy.stats import norm, skew, kurtosis

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import statsmodels.graphics.gofplots as sm

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import BayesianRidge, LinearRegression, Lars
from sklearn.metrics import mean_squared_error, r2_score

import pycaret.regression as pr

import optuna

import matplotlib.pyplot as plt
import seaborn as sns
import scienceplots

plt.style.use(['notebook', 'grid', 'nature'])

First we are going to define some important functions that we will be using through the project.

In [2]:
# missing_vals function calculates the number and percentage of missing values and data type of each predictor given a pandas dataframe

def missing_vals(df):
    
    # Number of missing values
    
    missing = df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False).values
    
    # Percentage of missing values
    
    percentage = (df.isna().mean()*100)[df.isna().mean()*100 > 0].sort_values(ascending=False).values
    
    # Names of predictors with missing values
    
    names = df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False).index
    
    # Data type of predictors with missing values
    
    dtypes = df[names].dtypes.values
    
    # Collect information into an array
    
    data = np.array([dtypes, missing, percentage]).T
    
    # Transform the array into a pandas DataFrame
    
    return pd.DataFrame(data=data, index=names, columns=['Dtypes', '#Missing Values', '%Missing Values'])
In [3]:
# skew_kurtosis function calculates the kurtosis and skewness of each predictor of a given dataset

def skew_kurtosis(df):
    
    # Extract numeric features
    
    numeric_features = df.dtypes[df.dtypes != 'object'].index
    
    # Calculates skewness and kurtosis
    
    skewness_vals = df[numeric_features].apply(axis=0, func=lambda x: skew(x)).values
    
    kurtosis_vals = df[numeric_features].apply(axis=0, func=lambda x: kurtosis(x)).values
    
    # Collect information into an array
    
    data = np.array([skewness_vals, kurtosis_vals]).T
    
    # Transform the array into a pandas DataFrame
    
    return pd.DataFrame(data=data, index=numeric_features, columns=['Skewness', 'Kurtosis'])
In [4]:
# compute_vif function calculates the variance-inflation-factor of each predictor of a given dataset

def compute_vif(df, considered_features):
    
    # Use only the selected features
    
    X = df[considered_features]
    X = add_constant(X)
    
    # Calculate the variance-inflation-factor of each predictor and stores it in a pandas DataFrame
    
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif = vif[vif['Variable']!='const']
    
    return vif.sort_values(by=['VIF'], ascending=False)
In [5]:
# Given a dataset, a list of predictors and a maximum acceptable value for the variance-inflation-factor (threshold), the function eliminates the predictors that cause the factor to increase until the threshold value is reached

def reduce_vif(df, threshold, considered_features):
    
    # Compute VIF
    
    vif = compute_vif(df, considered_features)
    discarded_features = []
    
    # Each iteration VIF is calculated and predictors are drop until a threshold value is reached
    
    while vif.iloc[0, 1] > threshold:
        discarded_features.append(vif.iloc[0, 0])
        vif = compute_vif(df, considered_features.drop(discarded_features))
    
    return vif

Loading Data

In [7]:
# Import the dataset

df = pd.read_csv("data.csv")
df.head()
Out[7]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
In [8]:
# Drop the ID column

df.drop(['Id'], axis=1, inplace=True)
df.head()
Out[8]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
In [9]:
# Create a copy of the data

df1 = df.copy()

Missing Values

The number, percentage and data type of missing values is calculated.
Then a heatmap of missing values was created.
In this problem the most of the missing values do not mean that we do not know the information, instead it means that the house does not have something, like a pool or basement for example. So we need to substitute the NaN values for something else.
We also need to use the data_description.txt file to know how we can deal with the features.

In [10]:
# missing values table

missing_vals(df1)
Out[10]:
Dtypes #Missing Values %Missing Values
PoolQC object 1453 99.520548
MiscFeature object 1406 96.30137
Alley object 1369 93.767123
Fence object 1179 80.753425
FireplaceQu object 690 47.260274
LotFrontage float64 259 17.739726
GarageType object 81 5.547945
GarageYrBlt float64 81 5.547945
GarageFinish object 81 5.547945
GarageQual object 81 5.547945
GarageCond object 81 5.547945
BsmtExposure object 38 2.60274
BsmtFinType2 object 38 2.60274
BsmtFinType1 object 37 2.534247
BsmtCond object 37 2.534247
BsmtQual object 37 2.534247
MasVnrArea float64 8 0.547945
MasVnrType object 8 0.547945
Electrical object 1 0.068493
In [11]:
# heatmap of the missing values

plt.figure(figsize=(20, 7))
sns.heatmap(df1.isna(), cbar=False)
plt.show()
In [12]:
# Based on the data_description.txt file, the missing values were imputed

# Here MasVnrArea,GarageArea and GarageYrBlt were filled with 0's

fill_zero = ['MasVnrArea', 'GarageArea', 'GarageYrBlt']
df1[fill_zero] = SimpleImputer(strategy='constant', fill_value=0).fit_transform(df1[fill_zero])

# Here MSSubClass,YearBuilt, YearRemodAdd and others were changed to object data type

change_cat = ['MSSubClass', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold', 'OverallQual', 'OverallCond']
df1[change_cat] = df1[change_cat].astype(object)

# Here MasVnrArea,GarageArea and GarageYrBlt were filled with None's

fill_none = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'MasVnrType']
df1[fill_none] = SimpleImputer(strategy='constant', fill_value='None').fit_transform(df1[fill_none])

# Electrical has 1 missing value so we can delete it

delete_rows = ['Electrical']
df1.dropna(axis=0, subset=delete_rows, inplace=True)

# LotFrontage has some missing values, here we used a k-nearest-neighbors imputer to fill the values and for the number of neighbors we use the value of 5.

fill_num = ['LotFrontage']
knn_imputer = KNNImputer(n_neighbors=5)
df1[fill_num] = knn_imputer.fit_transform(df1[fill_num])
In [13]:
# missing values table

missing_vals(df1)
Out[13]:
Dtypes #Missing Values %Missing Values
In [14]:
# Create a copy of the data

df2 = df1.copy()

Feature Engineering

Here, we created new variables that could be useful on the regression analysis.

In [15]:
# Square per Room

df2["SqFtPerRoom"] = df2["GrLivArea"] / (df2["TotRmsAbvGrd"] + df2["FullBath"] + df2["HalfBath"] + df2["KitchenAbvGr"])

# Total Home Quality

df2['Total_Home_Quality'] = df2['OverallQual'] + df2['OverallCond']

# Total Bathrooms

df2['Total_Bathrooms'] = (df2['FullBath'] + (0.5*df2['HalfBath']) + df2['BsmtFullBath'] + (0.5*df2['BsmtHalfBath']))

# HighQualSF

df2["HighQualSF"] = df2["1stFlrSF"] + df2["2ndFlrSF"]
In [16]:
#Create a copy of the data

df3 = df2.copy()

Target Transformation

Several regression methods work better if the data is normalized. Here we calculate de Skewness and Kurtosis of the target (House Prices) and plotted its distribution. As you will see, the original data is left-skewed so we have to normalize it using the yeo-johnson transformation.

In [17]:
skew_kurtosis(df3[['SalePrice']])
Out[17]:
Skewness Kurtosis
SalePrice 1.880008 6.502799
In [18]:
# Histogram and Normal Probability Plot of the target (House Pricing)

fig, (ax1, ax2) = plt.subplots(ncols=2, nrows=1, figsize=(20, 7))

sns.histplot(df3['SalePrice'], stat='density', color='orange', ax=ax1)
mu, std = norm.fit(df3['SalePrice'])
xx = np.linspace(*ax1.get_xlim(),100)
ax1.set_title('Sales Price Distribution')
sns.lineplot(x=xx, y=norm.pdf(xx, mu, std), ax=ax1)

sm.ProbPlot(df3['SalePrice']).qqplot(line='s', ax=ax2)
ax1.set_title('Normal Probability Plot of Sales Price')

plt.show()
In [19]:
# Using yeo-johnson transformation on the target

target_transformer = PowerTransformer(method='yeo-johnson', standardize=False)

# Histogram and Normal Probability Plot of the transformed target (Normalized Target)

df3['Transformed_SalePrice'] = target_transformer.fit_transform(df3[['SalePrice']]).T[0]

fig, (ax1, ax2) = plt.subplots(ncols=2, nrows=1, figsize=(20, 7))

sns.histplot(df3['Transformed_SalePrice'], stat='density', color='orange', ax=ax1)
mu, std = norm.fit(df3['Transformed_SalePrice'])
xx = np.linspace(*ax1.get_xlim(),100)
ax1.set_title('Transformed Sales Price Distribution')
sns.lineplot(x=xx, y=norm.pdf(xx, mu, std), ax=ax1)

sm.ProbPlot(df3['Transformed_SalePrice']).qqplot(line='s', ax=ax2)
ax1.set_title('Normal Probability Plot of Transformed Sales Price')

plt.show()
In [20]:
# Create a copy of the data

df3.drop(['SalePrice'], axis=1, inplace=True)
df4 = df3.copy()

Features Transformation

Here we calculate the Skewness and Kurtosis of the predictors and normalized the data with the yeo-johnson transformation only if the $|skewness(x)|<2$ and $|kurtosis(x)|<7$.

In [21]:
# Skewness and kurtosis of the predictors

skew_kurtosis(df4.drop(['Transformed_SalePrice'], axis=1))
Out[21]:
Skewness Kurtosis
LotFrontage 2.382060 21.754015
LotArea 12.190881 202.402120
MasVnrArea 2.673798 10.095230
BsmtFinSF1 1.683465 11.079615
BsmtFinSF2 4.249219 20.023898
BsmtUnfSF 0.918367 0.466639
TotalBsmtSF 1.525190 13.232154
1stFlrSF 1.375089 5.724629
2ndFlrSF 0.813466 -0.554484
LowQualFinSF 8.998885 82.885802
GrLivArea 1.364297 4.868582
BsmtFullBath 0.594354 -0.841470
BsmtHalfBath 4.097541 16.322022
FullBath 0.037821 -0.858040
HalfBath 0.677275 -1.073973
BedroomAbvGr 0.211839 2.215847
KitchenAbvGr 4.482026 21.436776
TotRmsAbvGrd 0.676068 0.872000
Fireplaces 0.647913 -0.221309
GarageCars -0.341494 0.214062
GarageArea 0.179081 0.907592
WoodDeckSF 1.539362 2.974720
OpenPorchSF 2.361099 8.452397
EnclosedPorch 3.085342 10.381118
3SsnPorch 10.290132 123.147774
ScreenPorch 4.116334 18.356321
PoolArea 14.807992 222.344724
MiscVal 24.443278 698.121807
SqFtPerRoom 0.980318 2.875496
Total_Bathrooms 0.265074 -0.138523
HighQualSF 1.328266 4.853191
In [22]:
# Find the parameters with (abs(skew(x)) < 2) and (abs(kurtosis(x)) < 7)

skewed_values = skew_kurtosis(df4.drop(['Transformed_SalePrice'], axis=1))

threshold = (np.abs(skewed_values['Skewness']) < 2) | (np.abs(skewed_values['Kurtosis']) < 7)

skewed_values[threshold]
Out[22]:
Skewness Kurtosis
BsmtFinSF1 1.683465 11.079615
BsmtUnfSF 0.918367 0.466639
TotalBsmtSF 1.525190 13.232154
1stFlrSF 1.375089 5.724629
2ndFlrSF 0.813466 -0.554484
GrLivArea 1.364297 4.868582
BsmtFullBath 0.594354 -0.841470
FullBath 0.037821 -0.858040
HalfBath 0.677275 -1.073973
BedroomAbvGr 0.211839 2.215847
TotRmsAbvGrd 0.676068 0.872000
Fireplaces 0.647913 -0.221309
GarageCars -0.341494 0.214062
GarageArea 0.179081 0.907592
WoodDeckSF 1.539362 2.974720
SqFtPerRoom 0.980318 2.875496
Total_Bathrooms 0.265074 -0.138523
HighQualSF 1.328266 4.853191
In [23]:
# Transformation of the predictors with with (abs(skew(x)) < 2) and (abs(kurtosis(x)) < 7)

skewed_features = skewed_values[threshold].index
skewed_features

parameter_transformer = PowerTransformer(method='yeo-johnson', standardize=False)

df4[skewed_features] = parameter_transformer.fit_transform(df4[skewed_features])
In [24]:
# Create a copy of the data

df5 = df4.copy()

Encoding

Categorical variables can not be used as they are in regression models, so we need to encode them into numerical values.

Ordinal Encoding

Some of the categorical variables have an order that matter, for this variables we use an ordinal encoder.
The file data_description.txt is used to decide which predictors are going to encode with the ordinal encoder.

In [25]:
# Different predictors were encoded with the OdinalEnconder() function from sklearn

ordinal_features = ['MSSubClass', 'OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC', 'Functional', 'Fence', 'GarageFinish', 'LandSlope', 'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'YrSold', 'MoSold', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Total_Home_Quality']
ordinal_encoder = OrdinalEncoder()
df5[ordinal_features] = ordinal_encoder.fit_transform(df5[ordinal_features])
In [26]:
# All numerical features were stored for transform them later

standardize_features = df5.dtypes[df5.dtypes != 'object'].index
standardize_features = standardize_features[:-1]
In [27]:
# Create a copy of the data

df6 = df5.copy()

One-Hot-Encoding

Some of the categorical variables do not have an order that matter, for this variables we use an one-hot-encoder encoder.
The file data_description.txt is used to decide which predictors are going to encode with the one-hot-encoder encoder.

In [28]:
# OneHotEncoder from sklearn was used to encode several predictors

ohe_features = df6[df6.dtypes[df6.dtypes == 'object'].index].columns
ohe_encoder = OneHotEncoder(sparse=False, drop=None)
ohe_encoded = ohe_encoder.fit_transform(df6[ohe_features])
In [29]:
# This piece of code stores the name of the new variables created after encoding

ohe_categories = []
counter = 0

for i in ohe_encoder.categories_:
    for j in i:
        counter += 1
        ohe_categories.append(j + str(counter)) 

# Original variables before encoding were deleted        
        
df6.drop(ohe_features, axis=1, inplace=True)
other_features = df6.columns.values
In [30]:
# Ohe encoded data was concatenated with the rest of the data

concatenated_data = np.concatenate((df6.values, ohe_encoded), axis=1)

transformed_data = pd.DataFrame(data=concatenated_data, columns=[*other_features, *ohe_categories])
In [31]:
# Create a copy of the data

df7 = transformed_data.copy()

Scaling

Several predictors needs to be standardized because it will be helpful analyzing the data and the future regression analysis.

In [32]:
# Standardization was performed with the StandardScaler function from sklearn

standard_scaler = StandardScaler()

df7[standardize_features] = standard_scaler.fit_transform(df7[standardize_features])
In [33]:
# Create a copy of the data

df8 =df7.copy()

Feature Selection

Not all the predictors will help us in the regression analysis, instead they may cause problems.
First of all, we need to delete all variables that are not correlated with the target (House Prices).
Then we will delete the predictors that are dependent on others predictors because we want to avoid multicolinearity.

Correlation Analysis

Predictors must be related to the target, if they are not related they can be harmful to the regression analysis or other techniques.

In [34]:
# Pearson correlation coefficient was used to determine the correlation between the predictors and the target (House-Prices)

corr = df8.corr(method='pearson')[['Transformed_SalePrice']].sort_values(by=['Transformed_SalePrice'], ascending=False)

Features selected have the following characteristic: $|abs(Pearson(x))|\leq 1$ and $|abs(Pearson(x))|\geq 0.4$

In [35]:
# Selection of the correlated features with the target

selected_features_1 = corr[(np.abs(corr['Transformed_SalePrice']) <= 1) & (np.abs(corr['Transformed_SalePrice']) >= 0.4)]
features_1 = selected_features_1.index[1:]
selected_features_1
Out[35]:
Transformed_SalePrice
Transformed_SalePrice 1.000000
OverallQual 0.815235
HighQualSF 0.736232
GrLivArea 0.729388
GarageCars 0.683783
Total_Bathrooms 0.676208
GarageArea 0.647242
Total_Home_Quality 0.645402
TotalBsmtSF 0.611456
1stFlrSF 0.607305
GarageYrBlt 0.602061
YearBuilt 0.599856
SqFtPerRoom 0.592996
FullBath 0.592231
YearRemodAdd 0.567088
TotRmsAbvGrd 0.538937
PConc123 0.530161
Fireplaces 0.511809
MasVnrArea 0.421457
Attchd139 0.419819
GarageFinish -0.414034
HeatingQC -0.425112
KitchenQual -0.526739
BsmtQual -0.572155
ExterQual -0.574202
In [36]:
# Create a copy of the data

df9 = df8[selected_features_1.index].copy()
In [37]:
# Correlation Heatmap of selected variables

plt.figure(figsize=(20, 7))
sns.heatmap(df9[features_1].corr())
plt.show()

Multicollinearity Analysis

Here we are going to delete the predictors that depend on other predictors, we will be doing this by calculating the VIF values of the data and we will be dropping off the variables that may cause the VIF values to increase.
The maximum VIF value allowed is 5.

In [38]:
# VIF of the last data processed

compute_vif(df9, features_1)
Out[38]:
Variable VIF
3 GrLivArea 172.929295
2 HighQualSF 111.020176
15 TotRmsAbvGrd 22.159986
12 SqFtPerRoom 18.155202
1 OverallQual 7.166322
6 GarageArea 6.526810
4 GarageCars 5.715406
11 YearBuilt 5.358260
10 GarageYrBlt 5.153571
7 Total_Home_Quality 3.707992
9 1stFlrSF 3.407825
13 FullBath 3.311463
5 Total_Bathrooms 2.881309
8 TotalBsmtSF 2.805675
16 PConc123 2.627893
14 YearRemodAdd 2.562225
24 ExterQual 2.399325
23 BsmtQual 2.281713
22 KitchenQual 1.963198
21 HeatingQC 1.558463
20 GarageFinish 1.540274
19 Attchd139 1.531702
17 Fireplaces 1.518575
18 MasVnrArea 1.359740
In [39]:
# The maximum VIF value allowed is 5

selected_features_2 = reduce_vif(df9, 5, features_1)
features_2 = selected_features_2['Variable'].to_list()
selected_features_2
Out[39]:
Variable VIF
7 YearBuilt 4.748294
6 GarageYrBlt 4.138371
5 1stFlrSF 3.138479
4 TotalBsmtSF 2.702933
1 GarageCars 2.681481
9 FullBath 2.556512
12 PConc123 2.537507
2 Total_Bathrooms 2.488452
10 YearRemodAdd 2.407694
20 ExterQual 2.336097
19 BsmtQual 2.227776
18 KitchenQual 1.946509
11 TotRmsAbvGrd 1.910599
8 SqFtPerRoom 1.803938
3 Total_Home_Quality 1.774004
17 HeatingQC 1.546306
15 Attchd139 1.528695
16 GarageFinish 1.504329
13 Fireplaces 1.498017
14 MasVnrArea 1.344485
In [40]:
# Heatmap of correlation values

plt.figure(figsize=(20, 7))
sns.heatmap(df9[features_2].corr())
plt.show()
In [41]:
# Create a copy of the data

features_2.append('Transformed_SalePrice')
df10 = df9[features_2].copy()

Outlier Removal

Here we are using an automatic Outlier Removal method, the function used was provided by sklearn and is named LocalOutlierFactor(), this method compares the distance between points and decides if a value is an outlier or not (It is based on k-nearest neighbors).
The numbers of neighbors (numbers to be compared) chosen is 20.
The function throws a value and the data chosen must have a value grater or equal than -1.4

In [42]:
# LocalOutlierDetector was used and the negative values were stored in the last dataset

outlier_detector = LocalOutlierFactor(n_neighbors=20)
outlier_detector.fit_predict(df10)
df10['NOF'] = outlier_detector.negative_outlier_factor_
In [43]:
# Only the data with NOF greater than -1.4 was selected

print('Original Shape', df10.shape)
df10 = df10[df10['NOF'] >= -1.4]
df10.drop(['NOF'], axis=1, inplace=True)
print('New Shape', df10.shape)
Original Shape (1459, 22)
New Shape (1387, 21)
In [44]:
# Create a copy of the data

df11 = df10.copy()

Splitting Data

50% of the data was used for training the regression models, 25% for testing and 25% for validation.

In [45]:
# train_test_split function from sklearn is used to split the dataset into train, test and validation sets

X = df11.drop(['Transformed_SalePrice'], axis=1)
y = df11['Transformed_SalePrice']

X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.5, random_state=86987)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=12345)

Model Selection

Several regression techniques exist and can be found on books, papers or internet, but there are libraries that can be helpful comparing several models. For the model selection we will be using Pycaret library

In [46]:
# Configuration of pycaret environment

_ = pr.setup(data=df11, target='Transformed_SalePrice', session_id=12345)
  Description Value
0 session_id 12345
1 Target Transformed_SalePrice
2 Original Data (1387, 21)
3 Missing Values False
4 Numeric Features 18
5 Categorical Features 2
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (970, 20)
10 Transformed Test Set (417, 20)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 10
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI 42a5
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize False
28 Normalize Method None
29 Transformation False
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers False
39 Outliers Threshold None
40 Remove Multicollinearity False
41 Multicollinearity Threshold None
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target False
57 Transform Target Method box-cox

Here we can see that bayesian ridge regression and simple linear regression were the two best methods, so we are going to use them in the regression analysis.

In [47]:
# Search for the best regression models

top3 = pr.compare_models(n_select=3)
  Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
br Bayesian Ridge 0.0385 0.0030 0.0534 0.8724 0.0061 0.0049 0.0100
lr Linear Regression 0.0385 0.0030 0.0534 0.8723 0.0061 0.0049 0.7910
ridge Ridge Regression 0.0385 0.0030 0.0534 0.8723 0.0061 0.0049 0.0060
lar Least Angle Regression 0.0385 0.0030 0.0534 0.8723 0.0061 0.0049 0.0060
huber Huber Regressor 0.0382 0.0030 0.0536 0.8718 0.0061 0.0049 0.0110
catboost CatBoost Regressor 0.0383 0.0030 0.0537 0.8707 0.0061 0.0049 0.9270
gbr Gradient Boosting Regressor 0.0391 0.0031 0.0552 0.8637 0.0063 0.0050 0.0500
lightgbm Light Gradient Boosting Machine 0.0406 0.0032 0.0555 0.8624 0.0063 0.0052 0.0410
rf Random Forest Regressor 0.0408 0.0034 0.0576 0.8522 0.0065 0.0052 0.1160
xgboost Extreme Gradient Boosting 0.0440 0.0036 0.0594 0.8431 0.0067 0.0056 0.1750
et Extra Trees Regressor 0.0429 0.0037 0.0605 0.8372 0.0069 0.0055 0.0920
par Passive Aggressive Regressor 0.0467 0.0038 0.0612 0.8333 0.0069 0.0060 0.0070
knn K Neighbors Regressor 0.0455 0.0041 0.0632 0.8223 0.0072 0.0058 0.0110
ada AdaBoost Regressor 0.0532 0.0049 0.0700 0.7823 0.0079 0.0068 0.0440
dt Decision Tree Regressor 0.0620 0.0072 0.0844 0.6836 0.0096 0.0079 0.0070
omp Orthogonal Matching Pursuit 0.0700 0.0083 0.0906 0.6352 0.0103 0.0089 0.0050
lasso Lasso Regression 0.1180 0.0229 0.1511 -0.0063 0.0171 0.0150 0.3040
en Elastic Net 0.1180 0.0229 0.1511 -0.0063 0.0171 0.0150 0.0050
llar Lasso Least Angle Regression 0.1180 0.0229 0.1511 -0.0063 0.0171 0.0150 0.0110
dummy Dummy Regressor 0.1180 0.0229 0.1511 -0.0063 0.0171 0.0150 0.0130
In [48]:
top3
Out[48]:
[BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
               compute_score=False, copy_X=True, fit_intercept=True,
               lambda_1=1e-06, lambda_2=1e-06, lambda_init=None, n_iter=300,
               normalize=False, tol=0.001, verbose=False),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False),
 Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
       normalize=False, random_state=12345, solver='auto', tol=0.001)]

HyperParameter Optimization

For the hyperparameter optimization we will be using optuna library.
Simple linear regression does not need of optimization
The bayesian ridge regression needs to be optimized.

In [64]:
# Dummy dictionary

scores = {'Regressor':[], 'RMSE':[]}

Bayesian Ridge Regression

Optuna needs a function that needs to be optimized, for that we create a set of variables that will be entered into the Bayesian Ridge Regressor, the parameter we want to minimize is the RMSE (root mean square error).
Inside the function we will be training the regressor with the train set of data and calculating the RMSE with the test set of data.

In [50]:
# Function to be optimized

def br_optimizer(trial):
    
    # Variables we will be looking for optimization the function
    
    alpha_1 = trial.suggest_loguniform('alpha_1', 1e-11, 1e-3)
    alpha_2 = trial.suggest_loguniform('alpha_2', 1e-11, 1e-3)
    lambda_1 = trial.suggest_loguniform('lambda_1', 1e-11, 1e-3)
    lambda_2 = trial.suggest_loguniform('lambda_2', 1e-11, 1e-3)
    
    # Default variables
    
    compute_score = False
    fit_intercept = True
    tol = 1e-9
    n_iter = int(1e4)
    
    parameters = {'alpha_1':alpha_1, 'alpha_2':alpha_2, 'lambda_1':lambda_1, 'lambda_2':lambda_2, 'compute_score':compute_score, 'fit_intercept':fit_intercept, 'tol':tol, 'n_iter':n_iter}
    
    # Training of the model
    
    model = BayesianRidge(**parameters)
    model.fit(X_train, y_train)
    
    # Predictions and RMSE with the test dataset
    
    predictions = model.predict(X_test)
    test_score = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_test, predictions))]])[0][0]
    
    return test_score
In [51]:
# Optimization with Optuna library (100 iterations)

study = optuna.create_study(direction='minimize')
study.optimize(br_optimizer, n_trials=100)
[I 2023-04-10 21:14:36,559] A new study created in memory with name: no-name-af0681b9-663f-4787-90c7-843b026e8af3
[I 2023-04-10 21:14:36,572] Trial 0 finished with value: 0.05382337121625924 and parameters: {'alpha_1': 1.3140110789242623e-08, 'alpha_2': 0.0001457560069024532, 'lambda_1': 1.2228004513933517e-09, 'lambda_2': 2.075416250506302e-07}. Best is trial 0 with value: 0.05382337121625924.
[I 2023-04-10 21:14:36,581] Trial 1 finished with value: 0.053823367166067504 and parameters: {'alpha_1': 0.0006927072229209075, 'alpha_2': 0.0001103546840193058, 'lambda_1': 1.4801871924133962e-05, 'lambda_2': 3.335636002978507e-10}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,595] Trial 2 finished with value: 0.05382337917556135 and parameters: {'alpha_1': 1.6556056409503633e-05, 'alpha_2': 4.007834120411677e-08, 'lambda_1': 9.643978996312987e-11, 'lambda_2': 2.690540387101202e-10}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,599] Trial 3 finished with value: 0.053847404667793475 and parameters: {'alpha_1': 1.7005171891687616e-07, 'alpha_2': 6.153270695563172e-07, 'lambda_1': 1.6142353727422618e-11, 'lambda_2': 0.0007510040381849365}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,618] Trial 4 finished with value: 0.05382618906960834 and parameters: {'alpha_1': 0.00011601247818458665, 'alpha_2': 7.530377635117807e-11, 'lambda_1': 2.8070975890672898e-08, 'lambda_2': 7.445115201348293e-05}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,630] Trial 5 finished with value: 0.05382337370724133 and parameters: {'alpha_1': 4.0012280782906585e-09, 'alpha_2': 4.996635185911652e-05, 'lambda_1': 3.9968127582047413e-07, 'lambda_2': 3.186854246089336e-10}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,641] Trial 6 finished with value: 0.05382388766836721 and parameters: {'alpha_1': 8.698342115372465e-05, 'alpha_2': 1.2138356888813327e-05, 'lambda_1': 4.2863636317566495e-11, 'lambda_2': 1.328733766219883e-05}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,650] Trial 7 finished with value: 0.05382584195955187 and parameters: {'alpha_1': 1.1385705813859182e-08, 'alpha_2': 2.3117561034599434e-08, 'lambda_1': 1.4188110691664925e-07, 'lambda_2': 6.509221623283479e-05}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,660] Trial 8 finished with value: 0.053825330775056024 and parameters: {'alpha_1': 2.9239102908423263e-06, 'alpha_2': 2.0908153725962885e-11, 'lambda_1': 0.0006934278180035639, 'lambda_2': 5.160894406645457e-05}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,673] Trial 9 finished with value: 0.05382338238522455 and parameters: {'alpha_1': 2.6422757717518873e-07, 'alpha_2': 4.291352830459622e-08, 'lambda_1': 6.169281977415302e-09, 'lambda_2': 8.377725862520708e-08}. Best is trial 1 with value: 0.053823367166067504.
[I 2023-04-10 21:14:36,689] Trial 10 finished with value: 0.0538233095024121 and parameters: {'alpha_1': 1.8707619337361081e-10, 'alpha_2': 0.0006338782174839441, 'lambda_1': 2.9727186958985835e-05, 'lambda_2': 1.8997545527315117e-11}. Best is trial 10 with value: 0.0538233095024121.
[I 2023-04-10 21:14:36,721] Trial 11 finished with value: 0.05382327326984693 and parameters: {'alpha_1': 3.221079593152367e-11, 'alpha_2': 0.0009623244603064243, 'lambda_1': 5.740264429400375e-05, 'lambda_2': 1.0564885868520879e-11}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,738] Trial 12 finished with value: 0.05382328187389751 and parameters: {'alpha_1': 1.1184146422161587e-11, 'alpha_2': 0.0008866652082431946, 'lambda_1': 2.9184660741053153e-05, 'lambda_2': 1.266470316744713e-11}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,763] Trial 13 finished with value: 0.05382337871914333 and parameters: {'alpha_1': 1.889287695805843e-11, 'alpha_2': 3.1587892640702603e-06, 'lambda_1': 8.468407855527638e-06, 'lambda_2': 1.618383486501931e-11}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,783] Trial 14 finished with value: 0.05382337085954347 and parameters: {'alpha_1': 1.0983504589257292e-11, 'alpha_2': 2.4322446445715504e-09, 'lambda_1': 0.0007474238294132763, 'lambda_2': 1.368915836443103e-08}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,802] Trial 15 finished with value: 0.05382329020803445 and parameters: {'alpha_1': 3.2037357012551e-10, 'alpha_2': 0.00081471350443961, 'lambda_1': 1.30866346201852e-06, 'lambda_2': 3.626565701533689e-09}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,823] Trial 16 finished with value: 0.053823377401077455 and parameters: {'alpha_1': 1.0851648733774125e-10, 'alpha_2': 6.096383890881681e-07, 'lambda_1': 0.0001437277308329093, 'lambda_2': 5.162568751844182e-11}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,844] Trial 17 finished with value: 0.053823411727481396 and parameters: {'alpha_1': 8.309855326620594e-10, 'alpha_2': 1.3267151376399387e-05, 'lambda_1': 2.1313645678099383e-06, 'lambda_2': 8.842086971827215e-07}. Best is trial 11 with value: 0.05382327326984693.
[I 2023-04-10 21:14:36,865] Trial 18 finished with value: 0.053823270991087746 and parameters: {'alpha_1': 3.947620580965173e-11, 'alpha_2': 0.000979847123159681, 'lambda_1': 0.0001039835046195131, 'lambda_2': 4.8718136392683216e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:36,887] Trial 19 finished with value: 0.053823377723321686 and parameters: {'alpha_1': 2.771694206004484e-09, 'alpha_2': 8.02723040612229e-10, 'lambda_1': 0.00013074896953712994, 'lambda_2': 2.7084606928139613e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:36,906] Trial 20 finished with value: 0.05382342205491275 and parameters: {'alpha_1': 1.1017859373994636e-10, 'alpha_2': 1.1460754343391327e-06, 'lambda_1': 0.00019311146935906213, 'lambda_2': 1.1767449317816847e-06}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:36,929] Trial 21 finished with value: 0.053823278898138316 and parameters: {'alpha_1': 3.4196859355656856e-11, 'alpha_2': 0.000911805964704225, 'lambda_1': 4.88325018442434e-05, 'lambda_2': 1.3465236640893246e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:36,949] Trial 22 finished with value: 0.053823365214408 and parameters: {'alpha_1': 5.272981133874647e-11, 'alpha_2': 0.00012697787979060296, 'lambda_1': 5.756121416498275e-06, 'lambda_2': 1.1636511464241151e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:36,973] Trial 23 finished with value: 0.05382337640221513 and parameters: {'alpha_1': 7.815622971792582e-10, 'alpha_2': 1.9885137249896402e-05, 'lambda_1': 5.55837817131904e-05, 'lambda_2': 1.8081640101492535e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:36,994] Trial 24 finished with value: 0.053823345577159065 and parameters: {'alpha_1': 4.228824258466591e-11, 'alpha_2': 0.0003143583206544199, 'lambda_1': 2.934896713648682e-06, 'lambda_2': 2.1392682285470825e-08}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,012] Trial 25 finished with value: 0.05382337559536432 and parameters: {'alpha_1': 1.087215794292926e-09, 'alpha_2': 3.307029870124688e-05, 'lambda_1': 5.045040748028589e-07, 'lambda_2': 1.3994920876544598e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,051] Trial 26 finished with value: 0.053823375341510715 and parameters: {'alpha_1': 3.298666568322326e-08, 'alpha_2': 3.3073905427073365e-06, 'lambda_1': 0.00029311973891968683, 'lambda_2': 6.844533408262058e-11}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,082] Trial 27 finished with value: 0.05382335391071025 and parameters: {'alpha_1': 9.555027977548168e-07, 'alpha_2': 0.00022694307585750838, 'lambda_1': 3.970217411729359e-05, 'lambda_2': 8.572758957042982e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,103] Trial 28 finished with value: 0.0538233631636158 and parameters: {'alpha_1': 4.464043876776926e-10, 'alpha_2': 5.776834627625402e-05, 'lambda_1': 0.0008507813725798405, 'lambda_2': 9.608630511960398e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,125] Trial 29 finished with value: 0.05382335747513056 and parameters: {'alpha_1': 4.183109023312408e-11, 'alpha_2': 0.00021599968001562788, 'lambda_1': 4.4117388066577405e-10, 'lambda_2': 5.016705920799181e-08}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,144] Trial 30 finished with value: 0.053823378489848084 and parameters: {'alpha_1': 3.2662688810841737e-08, 'alpha_2': 6.180241770072446e-06, 'lambda_1': 1.9447344339736633e-08, 'lambda_2': 4.973252329096082e-11}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,166] Trial 31 finished with value: 0.053823301510078414 and parameters: {'alpha_1': 1.0581175907071697e-11, 'alpha_2': 0.0007084260483862266, 'lambda_1': 1.6368342621783626e-05, 'lambda_2': 1.1241818727641012e-11}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,189] Trial 32 finished with value: 0.05382336588978531 and parameters: {'alpha_1': 3.2083716037926406e-11, 'alpha_2': 0.00011625749657152693, 'lambda_1': 4.800715008070854e-05, 'lambda_2': 1.763805104435678e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,211] Trial 33 finished with value: 0.05382327920272845 and parameters: {'alpha_1': 1.061389333562981e-10, 'alpha_2': 0.0009052184440138554, 'lambda_1': 8.561667977164974e-05, 'lambda_2': 6.364734924891871e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,232] Trial 34 finished with value: 0.05382334530156796 and parameters: {'alpha_1': 1.446938545554783e-10, 'alpha_2': 0.0003090028747635155, 'lambda_1': 6.979261714551173e-06, 'lambda_2': 2.7255330791372297e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,254] Trial 35 finished with value: 0.05382336919191655 and parameters: {'alpha_1': 4.778974486469458e-09, 'alpha_2': 6.515576399043214e-05, 'lambda_1': 0.00024306224389497537, 'lambda_2': 6.597293385879053e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,270] Trial 36 finished with value: 0.0538232741595106 and parameters: {'alpha_1': 7.281908001250026e-11, 'alpha_2': 0.0009529039918299902, 'lambda_1': 8.842264900264999e-05, 'lambda_2': 5.89301991989456e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,299] Trial 37 finished with value: 0.05382338709331358 and parameters: {'alpha_1': 1.3586040646283021e-09, 'alpha_2': 1.7334409272987657e-07, 'lambda_1': 7.882376665628722e-07, 'lambda_2': 2.0666803640205415e-07}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,320] Trial 38 finished with value: 0.053823368890573375 and parameters: {'alpha_1': 2.9033455146672044e-10, 'alpha_2': 9.453246055183351e-05, 'lambda_1': 1.6262781449127854e-05, 'lambda_2': 6.666257136171472e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,341] Trial 39 finished with value: 0.053823373999990265 and parameters: {'alpha_1': 5.854624134014343e-11, 'alpha_2': 2.772534968429219e-05, 'lambda_1': 0.0002905068756062321, 'lambda_2': 3.3777347399543456e-08}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,361] Trial 40 finished with value: 0.053823385131190005 and parameters: {'alpha_1': 1.036675562913137e-08, 'alpha_2': 0.00032325738504948836, 'lambda_1': 2.1080620773219e-07, 'lambda_2': 1.0731753471433351e-06}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,382] Trial 41 finished with value: 0.05382328103545597 and parameters: {'alpha_1': 8.257751321683128e-11, 'alpha_2': 0.0008899951605764937, 'lambda_1': 7.128117218569573e-05, 'lambda_2': 6.148142866610916e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,406] Trial 42 finished with value: 0.05382333839725195 and parameters: {'alpha_1': 2.6254752629297678e-11, 'alpha_2': 0.000363294438479445, 'lambda_1': 0.00010106403777757318, 'lambda_2': 4.011556418628913e-09}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,428] Trial 43 finished with value: 0.05382336587004133 and parameters: {'alpha_1': 0.000693546206158598, 'alpha_2': 0.00012199287562311353, 'lambda_1': 1.5819227628813916e-05, 'lambda_2': 3.2640784791719635e-11}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,451] Trial 44 finished with value: 0.05382327288693167 and parameters: {'alpha_1': 2.450417129333051e-10, 'alpha_2': 0.0009716816094045476, 'lambda_1': 3.6048391339157605e-06, 'lambda_2': 1.1233378383620296e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,478] Trial 45 finished with value: 0.05382333334557998 and parameters: {'alpha_1': 2.80097135240987e-10, 'alpha_2': 0.0004186550914338985, 'lambda_1': 3.667975151262266e-06, 'lambda_2': 1.2831074581398117e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,499] Trial 46 finished with value: 0.05384465132106886 and parameters: {'alpha_1': 1.6706060198430448e-05, 'alpha_2': 2.352935295350688e-09, 'lambda_1': 0.00037070021623191606, 'lambda_2': 0.0006499736936982101}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,520] Trial 47 finished with value: 0.053823362178762046 and parameters: {'alpha_1': 2.0032966330766625e-11, 'alpha_2': 0.00015532038499195712, 'lambda_1': 5.666967356852198e-08, 'lambda_2': 2.2157636082752014e-11}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,543] Trial 48 finished with value: 0.053823372751873544 and parameters: {'alpha_1': 5.795676635244549e-10, 'alpha_2': 5.59076794940952e-05, 'lambda_1': 2.6318659174641548e-05, 'lambda_2': 3.3519935569966567e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,566] Trial 49 finished with value: 0.053823379087399426 and parameters: {'alpha_1': 2.604817835262153e-09, 'alpha_2': 2.467906756921772e-10, 'lambda_1': 6.61469793162423e-06, 'lambda_2': 4.030882579317985e-11}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,584] Trial 50 finished with value: 0.05382338525716501 and parameters: {'alpha_1': 1.9616687906283416e-10, 'alpha_2': 8.279553251223625e-06, 'lambda_1': 1.751071631003279e-06, 'lambda_2': 1.8229809246146617e-07}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,610] Trial 51 finished with value: 0.05382327689334887 and parameters: {'alpha_1': 8.822868658360598e-11, 'alpha_2': 0.0009257016338373616, 'lambda_1': 9.08838898520298e-05, 'lambda_2': 4.2729578988871637e-10}. Best is trial 18 with value: 0.053823270991087746.
[I 2023-04-10 21:14:37,632] Trial 52 finished with value: 0.05382326858483766 and parameters: {'alpha_1': 1.7326889970694058e-11, 'alpha_2': 0.0009584447367337944, 'lambda_1': 0.0004898774786274983, 'lambda_2': 9.90499411629696e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,654] Trial 53 finished with value: 0.05382331837802412 and parameters: {'alpha_1': 1.6684565226626355e-11, 'alpha_2': 0.0004720027482950569, 'lambda_1': 0.0007770360745296322, 'lambda_2': 2.6900701198351547e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,675] Trial 54 finished with value: 0.053823350632462574 and parameters: {'alpha_1': 7.168986631106742e-11, 'alpha_2': 0.00020712489523222837, 'lambda_1': 0.0004979407568014492, 'lambda_2': 7.755654343650697e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,715] Trial 55 finished with value: 0.05382328044352214 and parameters: {'alpha_1': 1.9450382952144344e-11, 'alpha_2': 0.0008938092173039413, 'lambda_1': 9.004223865806261e-05, 'lambda_2': 1.828890411442546e-09}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,740] Trial 56 finished with value: 0.053823321634705046 and parameters: {'alpha_1': 1.0249010649876727e-11, 'alpha_2': 0.0005086566533339666, 'lambda_1': 0.0001615922224400536, 'lambda_2': 2.5869982483466863e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,765] Trial 57 finished with value: 0.053823370622733346 and parameters: {'alpha_1': 1.6023219393605665e-10, 'alpha_2': 7.676733644850165e-05, 'lambda_1': 2.894546580092616e-05, 'lambda_2': 5.089482834228627e-09}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,790] Trial 58 finished with value: 0.05382337390898595 and parameters: {'alpha_1': 5.771387522291759e-11, 'alpha_2': 1.0686613216787985e-08, 'lambda_1': 0.0004484085510467557, 'lambda_2': 1.1539493081686182e-09}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,812] Trial 59 finished with value: 0.053823358930223275 and parameters: {'alpha_1': 1.5066254022224864e-09, 'alpha_2': 0.00019053353091335814, 'lambda_1': 1.1001040094846185e-05, 'lambda_2': 1.9015452154665397e-08}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,831] Trial 60 finished with value: 0.05382337705323681 and parameters: {'alpha_1': 4.606919415054478e-10, 'alpha_2': 2.8693225770574534e-11, 'lambda_1': 0.00017890802075105742, 'lambda_2': 8.21808785480342e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,855] Trial 61 finished with value: 0.053823275079328825 and parameters: {'alpha_1': 3.28772365676039e-11, 'alpha_2': 0.000946122452158416, 'lambda_1': 5.467665367972926e-05, 'lambda_2': 1.72066360233601e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,876] Trial 62 finished with value: 0.05382331373500593 and parameters: {'alpha_1': 2.287666229821526e-11, 'alpha_2': 0.0004949732809036604, 'lambda_1': 0.0009580257606934533, 'lambda_2': 4.66025439242141e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,898] Trial 63 finished with value: 0.05382332034865489 and parameters: {'alpha_1': 9.677142916805579e-11, 'alpha_2': 0.0005379416123910496, 'lambda_1': 1.1439502590517313e-11, 'lambda_2': 1.7940002553922074e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,918] Trial 64 finished with value: 0.05382335021826168 and parameters: {'alpha_1': 3.3743161212198103e-11, 'alpha_2': 0.0002630838027752538, 'lambda_1': 2.3578409337914984e-05, 'lambda_2': 2.6558213633686503e-09}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,942] Trial 65 finished with value: 0.053823374443896954 and parameters: {'alpha_1': 3.766379816689584e-07, 'alpha_2': 3.673860428025373e-05, 'lambda_1': 5.9511203967091806e-05, 'lambda_2': 1.801318271029119e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,969] Trial 66 finished with value: 0.05382353407515028 and parameters: {'alpha_1': 2.4130567095848086e-10, 'alpha_2': 0.00014871364237857804, 'lambda_1': 2.9971999458587e-09, 'lambda_2': 4.450411290400289e-06}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:37,991] Trial 67 finished with value: 0.05382327468315906 and parameters: {'alpha_1': 5.303750464234971e-11, 'alpha_2': 0.0009399790422790387, 'lambda_1': 0.0001447303017493101, 'lambda_2': 9.064859878733532e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,014] Trial 68 finished with value: 0.05382337566469375 and parameters: {'alpha_1': 4.136899792810221e-11, 'alpha_2': 1.6198203237137507e-05, 'lambda_1': 0.00014626776763050322, 'lambda_2': 1.0862148941379923e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,033] Trial 69 finished with value: 0.053823318130172826 and parameters: {'alpha_1': 1.3155417475958356e-11, 'alpha_2': 0.0005051625056159538, 'lambda_1': 0.0004904857698473918, 'lambda_2': 4.031351738183745e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,061] Trial 70 finished with value: 0.05382335061379173 and parameters: {'alpha_1': 1.3745115496410153e-10, 'alpha_2': 0.00026074766307595415, 'lambda_1': 3.421676974385831e-06, 'lambda_2': 1.0794076374688886e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,081] Trial 71 finished with value: 0.05382327960986011 and parameters: {'alpha_1': 6.097613335268946e-11, 'alpha_2': 0.0009048074191320973, 'lambda_1': 5.389905038805243e-05, 'lambda_2': 3.057309832035962e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,105] Trial 72 finished with value: 0.05382332110726318 and parameters: {'alpha_1': 3.4435013711474733e-11, 'alpha_2': 0.0005032008451649218, 'lambda_1': 0.0002572379322304719, 'lambda_2': 1.9380740210608096e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,134] Trial 73 finished with value: 0.053823277942813386 and parameters: {'alpha_1': 1.0359423777682562e-10, 'alpha_2': 0.0009148344014713317, 'lambda_1': 0.00010430052675234102, 'lambda_2': 9.50400418656522e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,157] Trial 74 finished with value: 0.0538233682124607 and parameters: {'alpha_1': 0.0002435210848796929, 'alpha_2': 9.715838025695664e-05, 'lambda_1': 3.474946355834987e-05, 'lambda_2': 6.540431431803713e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,186] Trial 75 finished with value: 0.053823348292971085 and parameters: {'alpha_1': 4.421503155717986e-10, 'alpha_2': 0.000263992189865734, 'lambda_1': 0.0002053914243116235, 'lambda_2': 1.1032587447519123e-08}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,206] Trial 76 finished with value: 0.053823273108264624 and parameters: {'alpha_1': 2.8602397442966696e-11, 'alpha_2': 0.0009936845895099622, 'lambda_1': 1.3073253485604487e-05, 'lambda_2': 7.119164958608248e-08}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,227] Trial 77 finished with value: 0.05382334881658446 and parameters: {'alpha_1': 1.5706764557086833e-11, 'alpha_2': 0.00041682923377189455, 'lambda_1': 1.3678729940676318e-05, 'lambda_2': 3.9975236666779704e-07}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,255] Trial 78 finished with value: 0.05382336716637126 and parameters: {'alpha_1': 2.8813599653138957e-11, 'alpha_2': 0.00012546223103679672, 'lambda_1': 1.0100816790697055e-06, 'lambda_2': 4.504535713077476e-08}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,277] Trial 79 finished with value: 0.05382338170917267 and parameters: {'alpha_1': 4.93435196114995e-11, 'alpha_2': 1.514977456473696e-07, 'lambda_1': 4.716969699677451e-06, 'lambda_2': 6.797583524962029e-08}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,296] Trial 80 finished with value: 0.05382337925956038 and parameters: {'alpha_1': 2.063083441876548e-10, 'alpha_2': 4.1182001458582095e-05, 'lambda_1': 9.682007432167227e-06, 'lambda_2': 1.2241349021059593e-07}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,319] Trial 81 finished with value: 0.05382327264967279 and parameters: {'alpha_1': 8.319841913889777e-11, 'alpha_2': 0.0009656595542156136, 'lambda_1': 8.035222369829675e-05, 'lambda_2': 4.176677446304007e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,343] Trial 82 finished with value: 0.05382331617410485 and parameters: {'alpha_1': 2.4108120214900543e-11, 'alpha_2': 0.0005743450153092397, 'lambda_1': 2.202757794268394e-05, 'lambda_2': 1.88837038101788e-09}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,362] Trial 83 finished with value: 0.05382334495513774 and parameters: {'alpha_1': 6.46216713562467e-11, 'alpha_2': 0.00030852590315395546, 'lambda_1': 4.046148661791484e-05, 'lambda_2': 1.9723859470554333e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,385] Trial 84 finished with value: 0.05382331306953869 and parameters: {'alpha_1': 1.0932793608092466e-11, 'alpha_2': 0.0006001872696830248, 'lambda_1': 6.197668742255176e-05, 'lambda_2': 6.897240285776377e-09}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,400] Trial 85 finished with value: 0.05382326932929393 and parameters: {'alpha_1': 1.430119556823695e-10, 'alpha_2': 0.0009911192727929125, 'lambda_1': 0.000124745086169175, 'lambda_2': 9.015844978819769e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,428] Trial 86 finished with value: 0.05382339473575115 and parameters: {'alpha_1': 1.5114257301298605e-10, 'alpha_2': 7.493471640415015e-07, 'lambda_1': 0.00032093100260480715, 'lambda_2': 5.05062431576965e-07}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,451] Trial 87 finished with value: 0.05382335785017367 and parameters: {'alpha_1': 7.171615730067153e-10, 'alpha_2': 0.0001949353103877277, 'lambda_1': 4.74412476884729e-11, 'lambda_2': 9.277868881122482e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,472] Trial 88 finished with value: 0.05382333697251096 and parameters: {'alpha_1': 3.3125907631598435e-10, 'alpha_2': 0.00032738344701351983, 'lambda_1': 0.0005409237200651538, 'lambda_2': 3.296272897840085e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,493] Trial 89 finished with value: 0.053823306756273004 and parameters: {'alpha_1': 7.479879878585563e-06, 'alpha_2': 0.0006494612708534802, 'lambda_1': 0.00011813103729209175, 'lambda_2': 1.658854175726139e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,518] Trial 90 finished with value: 0.05382347670614829 and parameters: {'alpha_1': 1.017982874363749e-07, 'alpha_2': 6.303272585709196e-05, 'lambda_1': 0.00021305039910878796, 'lambda_2': 2.7795241793224692e-06}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,555] Trial 91 finished with value: 0.05382330542192082 and parameters: {'alpha_1': 4.154224685964394e-11, 'alpha_2': 0.0006701311530893982, 'lambda_1': 4.001720595609227e-05, 'lambda_2': 1.3650556854724778e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,576] Trial 92 finished with value: 0.053823281840905235 and parameters: {'alpha_1': 8.960600209370582e-11, 'alpha_2': 0.0008913121291045513, 'lambda_1': 7.178981946558159e-05, 'lambda_2': 2.541557668362553e-08}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,601] Trial 93 finished with value: 0.0538233368764387 and parameters: {'alpha_1': 1.8059552993570575e-11, 'alpha_2': 0.00038466730698782963, 'lambda_1': 2.0539427867256193e-05, 'lambda_2': 4.92784894213601e-10}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,626] Trial 94 finished with value: 0.05382326997689013 and parameters: {'alpha_1': 2.599522404043534e-11, 'alpha_2': 0.0009856705525699478, 'lambda_1': 0.00012022361974313476, 'lambda_2': 5.266831107757517e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,648] Trial 95 finished with value: 0.05382335570702601 and parameters: {'alpha_1': 1.290307875322647e-10, 'alpha_2': 0.00020142997274026771, 'lambda_1': 0.00012114830547963926, 'lambda_2': 5.826167700546263e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,668] Trial 96 finished with value: 0.05382333421669516 and parameters: {'alpha_1': 7.269244286187916e-11, 'alpha_2': 0.0003738372152419549, 'lambda_1': 0.0003442792561739629, 'lambda_2': 2.7060004595062214e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,692] Trial 97 finished with value: 0.05382330183409634 and parameters: {'alpha_1': 1.5862671414874884e-11, 'alpha_2': 0.0006393195569786813, 'lambda_1': 0.0006282633971746383, 'lambda_2': 4.53990656575729e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,712] Trial 98 finished with value: 0.0538233609539307 and parameters: {'alpha_1': 2.5170200624100936e-11, 'alpha_2': 0.00014267118108053515, 'lambda_1': 0.00022075051005247644, 'lambda_2': 1.7262669977084965e-11}. Best is trial 52 with value: 0.05382326858483766.
[I 2023-04-10 21:14:38,733] Trial 99 finished with value: 0.053823367948883316 and parameters: {'alpha_1': 4.625521325747993e-11, 'alpha_2': 8.635922913803903e-05, 'lambda_1': 0.0001500491622757453, 'lambda_2': 5.424969168185107e-11}. Best is trial 52 with value: 0.05382326858483766.
In [52]:
# Best parameters found are stored into a dictionary

br_params = study.best_params
br_params['compute_score'] = False
br_params['fit_intercept'] = True
br_params['tol'] = 1e-9
br_params['n_iter'] = int(1e4)
pd.DataFrame(data=br_params.values(), index=br_params.keys(), columns=['Value'])
Out[52]:
Value
alpha_1 0.0
alpha_2 0.000958
lambda_1 0.00049
lambda_2 0.0
compute_score False
fit_intercept True
tol 0.0
n_iter 10000
In [53]:
# RMSE of the regressor is showed

scores['Regressor'].append('BayesianRidgeRegression')
scores['RMSE'].append(study.best_value)
pd.DataFrame(data=scores)
Out[53]:
Regressor RMSE
0 BayesianRidgeRegression 0.053823
In [54]:
# Graph of the errors during training

optuna.visualization.plot_optimization_history(study)

Train, test and validation of models

In this section, the Bayesian Ridge Regression and the Simple Linear Regression models are trained with the training dataset and the RMSE and R2 values are calculated for the train, test and validation datasets for each model.
A function was created to simplify the task of training and testing different models.

In [6]:
# This function returns a pandas DataFrame that contains the RMSE and R2 values of a regression model

# The RMSE and R2 values are calculated for the train, test and validation datasets

def train_test_val(model, params, X_train, y_train, X_test, y_test, X_valid, y_valid):
    
    # Training of the model
    
    model = model(**params).fit(X_train, y_train)
    
    # Predictions for the train, test and validation datasets
    
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    valid_pred = model.predict(X_valid)
    
    # RMSE and R2 values for train, test and validation datasets
    
    train_RMSE = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_train, train_pred))]])[0][0]
    train_R2 = r2_score(y_train, train_pred)
    
    test_RMSE = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_test, test_pred))]])[0][0]
    test_R2 = r2_score(y_test, test_pred)
    
    valid_RMSE = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_valid, valid_pred))]])[0][0]
    valid_R2 = r2_score(y_valid, valid_pred)
    
    # RMSE and R2 values are stored into a dictionary
    
    scores = {'Data':['Train', 'Test', 'Validation'], 'RMSE':[train_RMSE, test_RMSE, valid_RMSE], 'R2':[train_R2, test_R2, valid_R2]}
    
    # The previous dictionary is transformed into a pandas DataFrame
    
    return pd.DataFrame(data=scores)

Bayesian Ridge Regression

In [55]:
br_params
Out[55]:
{'alpha_1': 1.7326889970694058e-11,
 'alpha_2': 0.0009584447367337944,
 'lambda_1': 0.0004898774786274983,
 'lambda_2': 9.90499411629696e-11,
 'compute_score': False,
 'fit_intercept': True,
 'tol': 1e-09,
 'n_iter': 10000}

With the Bayesian Ridge Regression model we obtain excellent results, the RMSE for the test and validation datasets is approximately 0.053 (a very tiny error) and the R2 is 0.88 and 0.87 for the test and validation set respectively.

In [56]:
train_test_val(model=BayesianRidge, params=br_params, X_train=X_train, X_test=X_test, X_valid=X_valid, y_train=y_train, y_test=y_test, y_valid=y_valid)
Out[56]:
Data RMSE R2
0 Train 0.056251 0.876213
1 Test 0.053823 0.884018
2 Validation 0.053637 0.875380

Simple Linear Regression

The Simple Linear Regression got very similar results to the bayesian Ridge Regression, this means that the data used after the data treatments is very "simple". And just a linear regression is enough to obtain good results.

In [57]:
fit_intercept = True
lr_params = {'fit_intercept':fit_intercept}
lr_params
Out[57]:
{'fit_intercept': True}
In [58]:
train_test_val(model=LinearRegression, params=lr_params, X_train=X_train, X_test=X_test, X_valid=X_valid, y_train=y_train, y_test=y_test, y_valid=y_valid)
Out[58]:
Data RMSE R2
0 Train 0.056225 0.876323
1 Test 0.053969 0.883406
2 Validation 0.053710 0.875049

Model Comparison

In this section we stored the coefficients obtained by the bayesian ridge and the simple linear regression models.
The magnitude of these coefficients tell us how much important is a feature or predictor for the target.

In [59]:
# Models are trained again with the train dataset

br = BayesianRidge(**br_params).fit(X_train, y_train)
lr = LinearRegression(**lr_params).fit(X_train, y_train)
In [60]:
# The name of the predictors and their respective regression coefficients are stored into a dictionary

# The absolute value of the coefficients is taken

predictors = X_train.columns.to_numpy()
br_coefs = np.abs(br.coef_)
lr_coefs = np.abs(lr.coef_)
In [61]:
# The last dictionary is transformed into a pandas DataFrame and the data is sorted by the magnitud of the simple linear regression coefficients

regression_coefficients = pd.DataFrame(data={'predictor':predictors, 'br_coefs':br_coefs, 'lr_coefs':lr_coefs}, columns=['predictor', 'br_coefs', 'lr_coefs'])
regression_coefficients = regression_coefficients.sort_values(by=['lr_coefs'], ascending=False)
regression_coefficients
Out[61]:
predictor br_coefs lr_coefs
14 Total_Home_Quality 0.041338 0.042565
0 YearBuilt 0.029025 0.031282
12 TotRmsAbvGrd 0.026820 0.027686
7 Total_Bathrooms 0.025816 0.026337
13 SqFtPerRoom 0.021018 0.021320
4 GarageCars 0.019344 0.019834
3 TotalBsmtSF 0.016294 0.016021
16 Attchd139 0.013292 0.013830
2 1stFlrSF 0.012858 0.012900
18 Fireplaces 0.012819 0.012486
5 FullBath 0.010895 0.012217
6 PConc123 0.007808 0.008741
15 HeatingQC 0.007811 0.007877
10 BsmtQual 0.007110 0.007102
11 KitchenQual 0.005320 0.005329
1 GarageYrBlt 0.004099 0.002574
19 MasVnrArea 0.002594 0.002161
8 YearRemodAdd 0.000692 0.001615
9 ExterQual 0.000562 0.001166
17 GarageFinish 0.000854 0.000444

The following graphs are very similar and are barplots that shows the magnitude of a regression coefficient with their respective predictor names, remember the magnitude of these coefficients tell us the impact of a predictor on the target (House Prices).
And as we can expect with basic knowledge about things that impact on house prices the five most importan features are:

  1. The total home quality
  2. The year it was built
  3. The total rooms above ground (above the basement)
  4. The total number of bathrooms
  5. The area of the rooms (dimensions)
In [62]:
plt.figure(figsize=(20, 10))
plt.bar(x=regression_coefficients.predictor, height=regression_coefficients.lr_coefs)
plt.title('Simple Linear Regression Coefficients')
plt.show()
In [63]:
plt.figure(figsize=(20, 10))
plt.bar(x=regression_coefficients.predictor, height=regression_coefficients.br_coefs)
plt.title('Bayesian Ridge Regression Coefficients')
plt.show()
In [ ]: